AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy Prediction

The environmental perception system is a critical component of autonomous vehicles, and multimodal perception systems significantly enhance perception capabilities by integrating camera and LiDAR data. This paper proposes a novel framework, AlignFusionNet. It effectively combines image and point clo...

Full description

Saved in:
Bibliographic Details
Main Authors: Ziyi Xu, Legan Qi, Hongzhou Du, Jiaqi Yang, Zhenglin Chen
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11082274/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849714500924604416
author Ziyi Xu
Legan Qi
Hongzhou Du
Jiaqi Yang
Zhenglin Chen
author_facet Ziyi Xu
Legan Qi
Hongzhou Du
Jiaqi Yang
Zhenglin Chen
author_sort Ziyi Xu
collection DOAJ
description The environmental perception system is a critical component of autonomous vehicles, and multimodal perception systems significantly enhance perception capabilities by integrating camera and LiDAR data. This paper proposes a novel framework, AlignFusionNet. It effectively combines image and point cloud data to construct an occupancy network, thereby improving target detection and representation. The framework introduces two innovative modules: a point-level data alignment module based on geometric transformations and an enhanced fusion module utilizing cross-attention mechanisms. These modules achieve precise point-level alignment and seamless feature fusion between point clouds and RGB images. Experiments on the nuScenes-Occupancy dataset demonstrate that the proposed AlignFusionNet outperforms baseline methods, achieving a significant 15.9% improvement in mIoU and a 4% increase in IoU. Compared to the previous state-of-the-art method, OccGen, mIoU is improved by 5.9%. Further qualitative visualization analysis shows that the proposed method achieves higher representation accuracy for small objects.
format Article
id doaj-art-5d5ea1ab61a441aeaa45c3b9cf435201
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-5d5ea1ab61a441aeaa45c3b9cf4352012025-08-20T03:13:42ZengIEEEIEEE Access2169-35362025-01-011312500312501510.1109/ACCESS.2025.358985811082274AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy PredictionZiyi Xu0https://orcid.org/0009-0008-1548-9642Legan Qi1Hongzhou Du2Jiaqi Yang3Zhenglin Chen4https://orcid.org/0009-0007-9229-9693Department of Electrical and Computer Engineering, University of Macau, Macau, ChinaDepartment of Electrical and Computer Engineering, University of Macau, Macau, ChinaDepartment of Electrical and Computer Engineering, University of Macau, Macau, ChinaZhejiang Key Laboratory of Imaging and Interventional Medicine, Zhejiang Engineering Research Center of Interventional Medicine Engineering and Biotechnology, The Fifth Affiliated Hospital of Wenzhou Medical University, Lishui, ChinaZhejiang Key Laboratory of Imaging and Interventional Medicine, Zhejiang Engineering Research Center of Interventional Medicine Engineering and Biotechnology, The Fifth Affiliated Hospital of Wenzhou Medical University, Lishui, ChinaThe environmental perception system is a critical component of autonomous vehicles, and multimodal perception systems significantly enhance perception capabilities by integrating camera and LiDAR data. This paper proposes a novel framework, AlignFusionNet. It effectively combines image and point cloud data to construct an occupancy network, thereby improving target detection and representation. The framework introduces two innovative modules: a point-level data alignment module based on geometric transformations and an enhanced fusion module utilizing cross-attention mechanisms. These modules achieve precise point-level alignment and seamless feature fusion between point clouds and RGB images. Experiments on the nuScenes-Occupancy dataset demonstrate that the proposed AlignFusionNet outperforms baseline methods, achieving a significant 15.9% improvement in mIoU and a 4% increase in IoU. Compared to the previous state-of-the-art method, OccGen, mIoU is improved by 5.9%. Further qualitative visualization analysis shows that the proposed method achieves higher representation accuracy for small objects.https://ieeexplore.ieee.org/document/11082274/3D occupancy predictionpoint cloudmulti-view imagemultimodal feature alignmentcross-attention mechanisms
spellingShingle Ziyi Xu
Legan Qi
Hongzhou Du
Jiaqi Yang
Zhenglin Chen
AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy Prediction
IEEE Access
3D occupancy prediction
point cloud
multi-view image
multimodal feature alignment
cross-attention mechanisms
title AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy Prediction
title_full AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy Prediction
title_fullStr AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy Prediction
title_full_unstemmed AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy Prediction
title_short AlignFusionNet: Efficient Cross-Modal Alignment and Fusion for 3D Semantic Occupancy Prediction
title_sort alignfusionnet efficient cross modal alignment and fusion for 3d semantic occupancy prediction
topic 3D occupancy prediction
point cloud
multi-view image
multimodal feature alignment
cross-attention mechanisms
url https://ieeexplore.ieee.org/document/11082274/
work_keys_str_mv AT ziyixu alignfusionnetefficientcrossmodalalignmentandfusionfor3dsemanticoccupancyprediction
AT leganqi alignfusionnetefficientcrossmodalalignmentandfusionfor3dsemanticoccupancyprediction
AT hongzhoudu alignfusionnetefficientcrossmodalalignmentandfusionfor3dsemanticoccupancyprediction
AT jiaqiyang alignfusionnetefficientcrossmodalalignmentandfusionfor3dsemanticoccupancyprediction
AT zhenglinchen alignfusionnetefficientcrossmodalalignmentandfusionfor3dsemanticoccupancyprediction