Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection

Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. Due to inherent differences between different modalities, seeking an efficient and accurate fusion method is of great importance. Recently, significant progress has...

Full description

Saved in:
Bibliographic Details
Main Authors: Chao Jie Zuo, Cao Yu Gu, Yi Kun Guo, Xiao Dong Miao
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10804146/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832592968194719744
author Chao Jie Zuo
Cao Yu Gu
Yi Kun Guo
Xiao Dong Miao
author_facet Chao Jie Zuo
Cao Yu Gu
Yi Kun Guo
Xiao Dong Miao
author_sort Chao Jie Zuo
collection DOAJ
description Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. Due to inherent differences between different modalities, seeking an efficient and accurate fusion method is of great importance. Recently, significant progress has been made in 3D object detection methods based on lift-splat-shot (LSS-based) approaches. However, inaccurate depth estimation and substantial semantic information loss remain significant factors limiting the accuracy of 3D detection. In this paper, we propose a cross-fusion framework under a dual spatial representation, by integrating information in different spatial representations, namely bird&#x2019;s-eye view (BEV) and camera view, and establishing soft links to fully utilize the information carried by different modalities. It consists of two important components, gated LiDAR supervised BEV (GLS-BEV) and multi-attention cross fusion (MACF) modules. The former achieves accurate depth estimation by supervising the transformation of LiDAR data with clear depth into the image space, constructing point cloud features in vehicle&#x2019;s perspective. The latter utilizes three sub-attention modules with different roles to achieve cross-modal interaction within the same space, effectively reducing semantic loss. On the nuScenes benchmark, our proposed method achieves outstanding 3D object detection results with 71.8 mAP and 74.2 NDS. The code is available at <uri>https://github.com/zcj223311/CSDSFusion</uri>.
format Article
id doaj-art-e7b4fed1b55749f39093c72a243f25ee
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-e7b4fed1b55749f39093c72a243f25ee2025-01-21T00:01:19ZengIEEEIEEE Access2169-35362025-01-0113104471045810.1109/ACCESS.2024.351856410804146Cross-Supervised LiDAR-Camera Fusion for 3D Object DetectionChao Jie Zuo0https://orcid.org/0009-0004-3997-9911Cao Yu Gu1https://orcid.org/0009-0000-0103-9562Yi Kun Guo2Xiao Dong Miao3https://orcid.org/0000-0002-5427-6550School of Mechanical and Power Engineering, Nanjing Tech University, Nanjing, ChinaSchool of Mechanical and Power Engineering, Nanjing Tech University, Nanjing, ChinaSchool of Mechanical and Power Engineering, Nanjing Tech University, Nanjing, ChinaSchool of Mechanical and Power Engineering, Nanjing Tech University, Nanjing, ChinaFusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. Due to inherent differences between different modalities, seeking an efficient and accurate fusion method is of great importance. Recently, significant progress has been made in 3D object detection methods based on lift-splat-shot (LSS-based) approaches. However, inaccurate depth estimation and substantial semantic information loss remain significant factors limiting the accuracy of 3D detection. In this paper, we propose a cross-fusion framework under a dual spatial representation, by integrating information in different spatial representations, namely bird&#x2019;s-eye view (BEV) and camera view, and establishing soft links to fully utilize the information carried by different modalities. It consists of two important components, gated LiDAR supervised BEV (GLS-BEV) and multi-attention cross fusion (MACF) modules. The former achieves accurate depth estimation by supervising the transformation of LiDAR data with clear depth into the image space, constructing point cloud features in vehicle&#x2019;s perspective. The latter utilizes three sub-attention modules with different roles to achieve cross-modal interaction within the same space, effectively reducing semantic loss. On the nuScenes benchmark, our proposed method achieves outstanding 3D object detection results with 71.8 mAP and 74.2 NDS. The code is available at <uri>https://github.com/zcj223311/CSDSFusion</uri>.https://ieeexplore.ieee.org/document/10804146/3D object detectionLiDAR-camera systemmulti-sensor fusionBEV
spellingShingle Chao Jie Zuo
Cao Yu Gu
Yi Kun Guo
Xiao Dong Miao
Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection
IEEE Access
3D object detection
LiDAR-camera system
multi-sensor fusion
BEV
title Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection
title_full Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection
title_fullStr Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection
title_full_unstemmed Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection
title_short Cross-Supervised LiDAR-Camera Fusion for 3D Object Detection
title_sort cross supervised lidar camera fusion for 3d object detection
topic 3D object detection
LiDAR-camera system
multi-sensor fusion
BEV
url https://ieeexplore.ieee.org/document/10804146/
work_keys_str_mv AT chaojiezuo crosssupervisedlidarcamerafusionfor3dobjectdetection
AT caoyugu crosssupervisedlidarcamerafusionfor3dobjectdetection
AT yikunguo crosssupervisedlidarcamerafusionfor3dobjectdetection
AT xiaodongmiao crosssupervisedlidarcamerafusionfor3dobjectdetection