Weighted Feature Fusion Network Based on Large Kernel Convolution and Transformer for Multi-Modal Remote Sensing Image Segmentation

The heterogeneity and complexity of multi-modal data in high-resolution remote sensing images posed a severe challenge to existing cross-modal networks that aim to fuse complementary information of high-resolution optical and elevation data information (DSM) to achieve accurate semantic segmentation...

Full description

Saved in:
Bibliographic Details
Main Authors: Jianxia Wang, Shaozu Qiu, Jia Cai, Xiaoming Zhang
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11123171/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The heterogeneity and complexity of multi-modal data in high-resolution remote sensing images posed a severe challenge to existing cross-modal networks that aim to fuse complementary information of high-resolution optical and elevation data information (DSM) to achieve accurate semantic segmentation. To solve this problem, a weighted feature fusion network based on large kernel convolution and Transformer (LTFCNet) was proposed. The model uses two parallel encoders to extract the features of different modalities, an improved cross-fusion module to enhance the encoder’s feature extraction capability, and a gate module based on large kernel and Transformer to achieve multi-modal fusion. Finally, a Difference information Feature Fusion Module (DFFM) leveraging attention to differential regions is used to achieve cross-level feature fusion and enhance small object detection. To evaluate the network, we compare it with several state-of-the-art models (SOTA), using the Potsdam and Vaihingen datasets. The experimental results demonstrate that the proposed model outperforms other SOTA models by approximately 2% in the mIoU metric, validating its effectiveness in multi-modal feature fusion.
ISSN:2169-3536