Video Instance Segmentation Through Hierarchical Offset Compensation and Temporal Memory Update for UAV Aerial Images

Despite the pivotal role of unmanned aerial vehicles (UAVs) in intelligent inspection tasks, existing video instance segmentation methods struggle with irregular deforming targets, leading to inconsistent segmentation results due to ineffective feature offset capture and temporal correlation modelin...

Full description

Saved in:
Bibliographic Details
Main Authors: Ying Huang, Yinhui Zhang, Zifen He, Yunnan Deng
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/14/4274
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Despite the pivotal role of unmanned aerial vehicles (UAVs) in intelligent inspection tasks, existing video instance segmentation methods struggle with irregular deforming targets, leading to inconsistent segmentation results due to ineffective feature offset capture and temporal correlation modeling. To address this issue, we propose a hierarchical offset compensation and temporal memory update method for video instance segmentation (HT-VIS) with a high generalization ability. Firstly, a hierarchical offset compensation (HOC) module in the form of a sequential and parallel connection is designed to perform deformable offset for the same flexible target across frames, which benefits from compensating for spatial motion features at the time sequence. Next, the temporal memory update (TMU) module is developed by employing convolutional long-short-term memory (ConvLSTM) between the current and adjacent frames to establish the temporal dynamic context correlation and update the current frame feature effectively. Finally, extensive experimental results demonstrate the superiority of the proposed HDNet method when applied to the public YouTubeVIS-2019 dataset and a self-built UAV-Seg segmentation dataset. On four typical datasets (i.e., Zoo, Street, Vehicle, and Sport) extracted from YoutubeVIS-2019 according to category characteristics, the proposed HT-VIS outperforms the state-of-the-art CNN-based VIS methods CrossVIS by 3.9%, 2.0%, 0.3%, and 3.8% in average segmentation accuracy, respectively. On the self-built UAV-VIS dataset, our HT-VIS with PHOC surpasses the baseline SipMask by 2.1% and achieves the highest average segmentation accuracy of 37.4% in the CNN-based methods, demonstrating the effectiveness and robustness of our proposed framework.
ISSN:1424-8220