Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

Estimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. De...

Full description

Saved in:

Bibliographic Details
Main Authors:	Andre O. Francani, Marcos R. O. A. Maximo
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Deep learning monocular visual odometry transformer video understanding
Online Access:	https://ieeexplore.ieee.org/document/10845764/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Estimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera’s pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at <uri>https://github.com/aofrancani/TSformer-VO</uri>.
ISSN:	2169-3536

Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

Similar Items