Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

Estimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. De...

Full description

Saved in:
Bibliographic Details
Main Authors: Andre O. Francani, Marcos R. O. A. Maximo
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10845764/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832586847856885760
author Andre O. Francani
Marcos R. O. A. Maximo
author_facet Andre O. Francani
Marcos R. O. A. Maximo
author_sort Andre O. Francani
collection DOAJ
description Estimating the camera&#x2019;s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera&#x2019;s pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at <uri>https://github.com/aofrancani/TSformer-VO</uri>.
format Article
id doaj-art-308fadb99b6c48519cd5a6457a522cb7
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-308fadb99b6c48519cd5a6457a522cb72025-01-25T00:02:47ZengIEEEIEEE Access2169-35362025-01-0113139591397110.1109/ACCESS.2025.353166710845764Transformer-Based Model for Monocular Visual Odometry: A Video Understanding ApproachAndre O. Francani0https://orcid.org/0000-0001-6576-1132Marcos R. O. A. Maximo1https://orcid.org/0000-0003-2944-4476Autonomous Computational Systems Laboratory, Aeronautics Institute of Technology, S&#x00E3;o Jos&#x00E9; dos Campos, S&#x00E3;o Paulo, BrazilAutonomous Computational Systems Laboratory, Aeronautics Institute of Technology, S&#x00E3;o Jos&#x00E9; dos Campos, S&#x00E3;o Paulo, BrazilEstimating the camera&#x2019;s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera&#x2019;s pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at <uri>https://github.com/aofrancani/TSformer-VO</uri>.https://ieeexplore.ieee.org/document/10845764/Deep learningmonocular visual odometrytransformervideo understanding
spellingShingle Andre O. Francani
Marcos R. O. A. Maximo
Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
IEEE Access
Deep learning
monocular visual odometry
transformer
video understanding
title Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_full Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_fullStr Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_full_unstemmed Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_short Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
title_sort transformer based model for monocular visual odometry a video understanding approach
topic Deep learning
monocular visual odometry
transformer
video understanding
url https://ieeexplore.ieee.org/document/10845764/
work_keys_str_mv AT andreofrancani transformerbasedmodelformonocularvisualodometryavideounderstandingapproach
AT marcosroamaximo transformerbasedmodelformonocularvisualodometryavideounderstandingapproach