Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation

Universal image segmentation aims to handle all segmentation tasks within a single model architecture and ideally requires only one training phase. To achieve task-conditioned joint training, a task token needs to be used in the multi-task training to condition the model for specific tasks. Existing...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yipeng Qu, Joohee Kim
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Sensors
Subjects:	computer vision universal image segmentation multimodal learning feature alignment
Online Access:	https://www.mdpi.com/1424-8220/25/2/359
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832587541736325120
author	Yipeng Qu Joohee Kim
author_facet	Yipeng Qu Joohee Kim
author_sort	Yipeng Qu
collection	DOAJ
description	Universal image segmentation aims to handle all segmentation tasks within a single model architecture and ideally requires only one training phase. To achieve task-conditioned joint training, a task token needs to be used in the multi-task training to condition the model for specific tasks. Existing approaches generate the task token from a text input (e.g., “the task is panoptic”). However, such text-based inputs merely serve as labels and fail to capture the inherent differences between tasks, potentially misleading the model. In addition, the discrepancy between visual and textual modalities limits the performance gains in existing text-involved segmentation models. Nevertheless, prevailing modality-alignment methods rely on large-scale uni-modal encoders for both modalities and an extremely large amount of paired data for training, and therefore it is hard to apply these existing models to lightweight segmentation models and resource-constrained devices. In this paper, we propose Adaptive Feature Alignment (AFA) integrated with a learnable task token to address these issues. The learnable task token automatically captures inter-task differences from both image features and text queries during training, providing a more effective and efficient solution than a predefined text-based token. To efficiently align the two modalities without introducing extra complexity, we reconsider the differences between a text token and an image token and replace image features with class-specific means in our proposed AFA. We evaluate our model performance on the ADE20K and Cityscapes datasets. Experimental results demonstrate that our model surpasses baseline models in both efficiency and effectiveness, achieving state-of-the-art performance among segmentation models with a comparable amount of parameters.
format	Article
id	doaj-art-fcc41f6be3774bcb81090bd288d5ab9d
institution	Kabale University
issn	1424-8220
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj-art-fcc41f6be3774bcb81090bd288d5ab9d2025-01-24T13:48:39ZengMDPI AGSensors1424-82202025-01-0125235910.3390/s25020359Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image SegmentationYipeng Qu0Joohee Kim1Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USADepartment of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USAUniversal image segmentation aims to handle all segmentation tasks within a single model architecture and ideally requires only one training phase. To achieve task-conditioned joint training, a task token needs to be used in the multi-task training to condition the model for specific tasks. Existing approaches generate the task token from a text input (e.g., “the task is panoptic”). However, such text-based inputs merely serve as labels and fail to capture the inherent differences between tasks, potentially misleading the model. In addition, the discrepancy between visual and textual modalities limits the performance gains in existing text-involved segmentation models. Nevertheless, prevailing modality-alignment methods rely on large-scale uni-modal encoders for both modalities and an extremely large amount of paired data for training, and therefore it is hard to apply these existing models to lightweight segmentation models and resource-constrained devices. In this paper, we propose Adaptive Feature Alignment (AFA) integrated with a learnable task token to address these issues. The learnable task token automatically captures inter-task differences from both image features and text queries during training, providing a more effective and efficient solution than a predefined text-based token. To efficiently align the two modalities without introducing extra complexity, we reconsider the differences between a text token and an image token and replace image features with class-specific means in our proposed AFA. We evaluate our model performance on the ADE20K and Cityscapes datasets. Experimental results demonstrate that our model surpasses baseline models in both efficiency and effectiveness, achieving state-of-the-art performance among segmentation models with a comparable amount of parameters.https://www.mdpi.com/1424-8220/25/2/359computer visionuniversal image segmentationmultimodal learningfeature alignment
spellingShingle	Yipeng Qu Joohee Kim Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation Sensors computer vision universal image segmentation multimodal learning feature alignment
title	Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_full	Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_fullStr	Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_full_unstemmed	Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_short	Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
title_sort	efficient multi task training with adaptive feature alignment for universal image segmentation
topic	computer vision universal image segmentation multimodal learning feature alignment
url	https://www.mdpi.com/1424-8220/25/2/359
work_keys_str_mv	AT yipengqu efficientmultitasktrainingwithadaptivefeaturealignmentforuniversalimagesegmentation AT jooheekim efficientmultitasktrainingwithadaptivefeaturealignmentforuniversalimagesegmentation

Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation

Similar Items