Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis

Few-shot multimodal sentiment analysis (FMSA) has garnered substantial attention due to the proliferation of multimedia applications, especially given the frequent difficulty in obtaining large quantities of training samples. Previous works have directly incorporated vision modality into the pre-tra...

Full description

Saved in:
Bibliographic Details
Main Authors: Junyang Yang, Jiuxin Cao, Chengge Duan
Format: Article
Language:English
Published: MDPI AG 2025-02-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/4/2095
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849718831654633472
author Junyang Yang
Jiuxin Cao
Chengge Duan
author_facet Junyang Yang
Jiuxin Cao
Chengge Duan
author_sort Junyang Yang
collection DOAJ
description Few-shot multimodal sentiment analysis (FMSA) has garnered substantial attention due to the proliferation of multimedia applications, especially given the frequent difficulty in obtaining large quantities of training samples. Previous works have directly incorporated vision modality into the pre-trained language model (PLM) and then leveraged prompt learning, showing effectiveness in few-shot scenarios. However, these methods encounter challenges in aligning the high-level semantics of different modalities due to their inherent heterogeneity, which impacts the performance of sentiment analysis. In this paper, we propose a novel framework called Multi-task Supervised Alignment Pre-training (MSAP) to enhance modality alignment and consequently improve the performance of multimodal sentiment analysis. Our approach uses a multi-task training method—incorporating image classification, image style recognition, and image captioning—to extract modal-shared information and stronger semantics to improve visual representation. We employ task-specific prompts to unify these diverse objectives into a single Masked Language Model (MLM), which serves as the foundation for our Multi-task Supervised Alignment Pre-training (MSAP) framework to enhance the alignment of visual and textual modalities. Extensive experiments on three datasets demonstrate that our method achieves a new state-of-the-art for the FMSA task.
format Article
id doaj-art-d6fcf5ea93784dee92c7434f43fe8a1d
institution DOAJ
issn 2076-3417
language English
publishDate 2025-02-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-d6fcf5ea93784dee92c7434f43fe8a1d2025-08-20T03:12:16ZengMDPI AGApplied Sciences2076-34172025-02-01154209510.3390/app15042095Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment AnalysisJunyang Yang0Jiuxin Cao1Chengge Duan2School of Cyber Science and Engineering, Southeast University, Nanjing 211189, ChinaSchool of Cyber Science and Engineering, Southeast University, Nanjing 211189, ChinaSchool of Cyber Science and Engineering, Southeast University, Nanjing 211189, ChinaFew-shot multimodal sentiment analysis (FMSA) has garnered substantial attention due to the proliferation of multimedia applications, especially given the frequent difficulty in obtaining large quantities of training samples. Previous works have directly incorporated vision modality into the pre-trained language model (PLM) and then leveraged prompt learning, showing effectiveness in few-shot scenarios. However, these methods encounter challenges in aligning the high-level semantics of different modalities due to their inherent heterogeneity, which impacts the performance of sentiment analysis. In this paper, we propose a novel framework called Multi-task Supervised Alignment Pre-training (MSAP) to enhance modality alignment and consequently improve the performance of multimodal sentiment analysis. Our approach uses a multi-task training method—incorporating image classification, image style recognition, and image captioning—to extract modal-shared information and stronger semantics to improve visual representation. We employ task-specific prompts to unify these diverse objectives into a single Masked Language Model (MLM), which serves as the foundation for our Multi-task Supervised Alignment Pre-training (MSAP) framework to enhance the alignment of visual and textual modalities. Extensive experiments on three datasets demonstrate that our method achieves a new state-of-the-art for the FMSA task.https://www.mdpi.com/2076-3417/15/4/2095few-shot multimodal sentiment analysismulti-task learningsupervised pre-training
spellingShingle Junyang Yang
Jiuxin Cao
Chengge Duan
Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis
Applied Sciences
few-shot multimodal sentiment analysis
multi-task learning
supervised pre-training
title Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis
title_full Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis
title_fullStr Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis
title_full_unstemmed Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis
title_short Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis
title_sort multi task supervised alignment pre training for few shot multimodal sentiment analysis
topic few-shot multimodal sentiment analysis
multi-task learning
supervised pre-training
url https://www.mdpi.com/2076-3417/15/4/2095
work_keys_str_mv AT junyangyang multitasksupervisedalignmentpretrainingforfewshotmultimodalsentimentanalysis
AT jiuxincao multitasksupervisedalignmentpretrainingforfewshotmultimodalsentimentanalysis
AT chenggeduan multitasksupervisedalignmentpretrainingforfewshotmultimodalsentimentanalysis