Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation

Universal image segmentation aims to handle all segmentation tasks within a single model architecture and ideally requires only one training phase. To achieve task-conditioned joint training, a task token needs to be used in the multi-task training to condition the model for specific tasks. Existing...

Full description

Saved in:
Bibliographic Details
Main Authors: Yipeng Qu, Joohee Kim
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/2/359
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832587541736325120
author Yipeng Qu
Joohee Kim
author_facet Yipeng Qu
Joohee Kim
author_sort Yipeng Qu
collection DOAJ
description Universal image segmentation aims to handle all segmentation tasks within a single model architecture and ideally requires only one training phase. To achieve task-conditioned joint training, a task token needs to be used in the multi-task training to condition the model for specific tasks. Existing approaches generate the task token from a text input (e.g., “the task is panoptic”). However, such text-based inputs merely serve as labels and fail to capture the inherent differences between tasks, potentially misleading the model. In addition, the discrepancy between visual and textual modalities limits the performance gains in existing text-involved segmentation models. Nevertheless, prevailing modality-alignment methods rely on large-scale uni-modal encoders for both modalities and an extremely large amount of paired data for training, and therefore it is hard to apply these existing models to lightweight segmentation models and resource-constrained devices. In this paper, we propose Adaptive Feature Alignment (AFA) integrated with a learnable task token to address these issues. The learnable task token automatically captures inter-task differences from both image features and text queries during training, providing a more effective and efficient solution than a predefined text-based token. To efficiently align the two modalities without introducing extra complexity, we reconsider the differences between a text token and an image token and replace image features with class-specific means in our proposed AFA. We evaluate our model performance on the ADE20K and Cityscapes datasets. Experimental results demonstrate that our model surpasses baseline models in both efficiency and effectiveness, achieving state-of-the-art performance among segmentation models with a comparable amount of parameters.
format Article
id doaj-art-fcc41f6be3774bcb81090bd288d5ab9d
institution Kabale University
issn 1424-8220
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-fcc41f6be3774bcb81090bd288d5ab9d2025-01-24T13:48:39ZengMDPI AGSensors1424-82202025-01-0125235910.3390/s25020359Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image SegmentationYipeng Qu0Joohee Kim1Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USADepartment of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USAUniversal image segmentation aims to handle all segmentation tasks within a single model architecture and ideally requires only one training phase. To achieve task-conditioned joint training, a task token needs to be used in the multi-task training to condition the model for specific tasks. Existing approaches generate the task token from a text input (e.g., “the task is panoptic”). However, such text-based inputs merely serve as labels and fail to capture the inherent differences between tasks, potentially misleading the model. In addition, the discrepancy between visual and textual modalities limits the performance gains in existing text-involved segmentation models. Nevertheless, prevailing modality-alignment methods rely on large-scale uni-modal encoders for both modalities and an extremely large amount of paired data for training, and therefore it is hard to apply these existing models to lightweight segmentation models and resource-constrained devices. In this paper, we propose Adaptive Feature Alignment (AFA) integrated with a learnable task token to address these issues. The learnable task token automatically captures inter-task differences from both image features and text queries during training, providing a more effective and efficient solution than a predefined text-based token. To efficiently align the two modalities without introducing extra complexity, we reconsider the differences between a text token and an image token and replace image features with class-specific means in our proposed AFA. We evaluate our model performance on the ADE20K and Cityscapes datasets. Experimental results demonstrate that our model surpasses baseline models in both efficiency and effectiveness, achieving state-of-the-art performance among segmentation models with a comparable amount of parameters.https://www.mdpi.com/1424-8220/25/2/359computer visionuniversal image segmentationmultimodal learningfeature alignment
spellingShingle Yipeng Qu
Joohee Kim
Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
Sensors
computer vision
universal image segmentation
multimodal learning
feature alignment
title Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_full Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_fullStr Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_full_unstemmed Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_short Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_sort efficient multi task training with adaptive feature alignment for universal image segmentation
topic computer vision
universal image segmentation
multimodal learning
feature alignment
url https://www.mdpi.com/1424-8220/25/2/359
work_keys_str_mv AT yipengqu efficientmultitasktrainingwithadaptivefeaturealignmentforuniversalimagesegmentation
AT jooheekim efficientmultitasktrainingwithadaptivefeaturealignmentforuniversalimagesegmentation