Combining convolutional neural network with transformer to improve YOLOv7 for gas plume detection and segmentation in multibeam water column images

Multibeam bathymetry has become an effective underwater target detection method by using echo signals to generate a high-resolution water column image (WCI). However, the gas plume in the image is often affected by the seafloor environment and exhibits sparse texture and changing motion, making trad...

Full description

Saved in:
Bibliographic Details
Main Authors: Wenguang Chen, Xiao Wang, Junjie Chen, Jialong Sun, Guozhen Zha
Format: Article
Language:English
Published: PeerJ Inc. 2025-05-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2923.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Multibeam bathymetry has become an effective underwater target detection method by using echo signals to generate a high-resolution water column image (WCI). However, the gas plume in the image is often affected by the seafloor environment and exhibits sparse texture and changing motion, making traditional detection and segmentation methods more time-consuming and labor-intensive. The emergence of convolutional neural networks (CNNs) alleviates this problem, but the local feature extraction of the convolutional operations, while capturing detailed information well, cannot adapt to the elongated morphology of the gas plume target, limiting the improvement of the detection and segmentation accuracy. Inspired by the transformer’s ability to achieve global modeling through self-attention, we combine CNN with the transformer to improve the existing YOLOv7 (You Only Look Once version 7) model. First, we sequentially reduce the ELAN (Efficient Layer Aggregation Networks) structure in the backbone network and verify that using the enhanced feature extraction module only in the deep network is more effective in recognising the gas plume targets. Then, the C-BiFormer module is proposed, which can achieve effective collaboration between local feature extraction and global semantic modeling while reducing computing resources, and enhance the multi-scale feature extraction capability of the model. Finally, two different depths of networks are designed by stacking C-BiFormer modules with different numbers of layers. This improves the receptive field so that the model’s detection and segmentation accuracy achieve different levels of improvement. Experimental results show that the improved model is smaller in size and more accurate compared to the baseline.
ISSN:2376-5992