Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection

Monocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces pro...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhijian Wang, Jie Liu, Yixiao Sun, Xiang Zhou, Boyan Sun, Dehong Kong, Jay Xu, Xiaoping Yue, Wenyu Zhang
Format:	Article
Language:	English
Published:	PeerJ Inc. 2025-01-01
Series:	PeerJ Computer Science
Subjects:	3D object detection Transformer Depth estimation
Online Access:	https://peerj.com/articles/cs-2656.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832577785675120640
author	Zhijian Wang Jie Liu Yixiao Sun Xiang Zhou Boyan Sun Dehong Kong Jay Xu Xiaoping Yue Wenyu Zhang
author_facet	Zhijian Wang Jie Liu Yixiao Sun Xiang Zhou Boyan Sun Dehong Kong Jay Xu Xiaoping Yue Wenyu Zhang
author_sort	Zhijian Wang
collection	DOAJ
description	Monocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces problems of heterogeneous fusion. In this article, we propose Depth Detection Transformer (Depth-DETR), applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection. Depth-DETR introduces two additional depth encoders besides the visual encoder. Two depth encoders are supervised by ground truth depth and bounding box respectively, working independently to complement each other’s limitations and predicting more accurate target distances. Furthermore, Depth-DETR employs cross modal attention mechanisms to effectively fuse three different features. A parallel structure of two cross modal transformer is applied to fuse two depth features with visual features. Avoiding early fusion between two depth features enhances the final fused feature for better feature representations. Through multiple experimental validations, the Depth-DETR model has achieved highly competitive results in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, with an AP score of 17.49, representing its outstanding performance in 3D object detection.
format	Article
id	doaj-art-0479e626ee194c418c737d4897e0fe9e
institution	Kabale University
issn	2376-5992
language	English
publishDate	2025-01-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj-art-0479e626ee194c418c737d4897e0fe9e2025-01-30T15:05:07ZengPeerJ Inc.PeerJ Computer Science2376-59922025-01-0111e265610.7717/peerj-cs.2656Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detectionZhijian Wang0Jie Liu1Yixiao Sun2Xiang Zhou3Boyan Sun4Dehong Kong5Jay Xu6Xiaoping Yue7Wenyu Zhang8School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, ChinaAnshan Power Supply Company, Liaoning Electric Power Limited Company of State Grid, Anshan, Liaoning, ChinaSchool of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, ChinaInner Mongolia Electronic Information Vocational Technical College, Huhehaote, Neimenggu, ChinaSchool of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, ChinaSchool of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, ChinaSchool of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, ChinaSchool of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, ChinaSchool of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning, ChinaMonocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces problems of heterogeneous fusion. In this article, we propose Depth Detection Transformer (Depth-DETR), applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection. Depth-DETR introduces two additional depth encoders besides the visual encoder. Two depth encoders are supervised by ground truth depth and bounding box respectively, working independently to complement each other’s limitations and predicting more accurate target distances. Furthermore, Depth-DETR employs cross modal attention mechanisms to effectively fuse three different features. A parallel structure of two cross modal transformer is applied to fuse two depth features with visual features. Avoiding early fusion between two depth features enhances the final fused feature for better feature representations. Through multiple experimental validations, the Depth-DETR model has achieved highly competitive results in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, with an AP score of 17.49, representing its outstanding performance in 3D object detection.https://peerj.com/articles/cs-2656.pdf3D object detectionTransformerDepth estimation
spellingShingle	Zhijian Wang Jie Liu Yixiao Sun Xiang Zhou Boyan Sun Dehong Kong Jay Xu Xiaoping Yue Wenyu Zhang Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection PeerJ Computer Science 3D object detection Transformer Depth estimation
title	Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection
title_full	Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection
title_fullStr	Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection
title_full_unstemmed	Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection
title_short	Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection
title_sort	applying auxiliary supervised depth assisted transformer and cross modal attention fusion in monocular 3d object detection
topic	3D object detection Transformer Depth estimation
url	https://peerj.com/articles/cs-2656.pdf
work_keys_str_mv	AT zhijianwang applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT jieliu applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT yixiaosun applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT xiangzhou applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT boyansun applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT dehongkong applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT jayxu applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT xiaopingyue applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection AT wenyuzhang applyingauxiliarysuperviseddepthassistedtransformerandcrossmodalattentionfusioninmonocular3dobjectdetection

Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection

Similar Items